Elena Tuzhilina
Oct 14, 2021
http://web.stanford.edu/~elenatuz/courses/stats32-aut2021/
readr/Users/elenatuz/Downloads/datafile.csv.)
/Users/elenatuz: ./Downloads/datafile.csv/Users/elenatuz/Downloads: ./datafile.csv or simply datafile.csvgetwd()setwd() function or Session > Set Working Directory > …fct_recode(): change factor levelsfct_collapse() and fct_lump(): reduce the number of factor levelsfct_infreq(): to sort factor levels by how often they appearfct_reorder(): to sort factor levels by some other variablefct_rev(): reverse the order of the factor levelsAll these functions are part of the forcats package, which is automatically loaded when you load the tidyverse package.
When you open RStudio, you should see something like this:
There should be 3 different windows along with a number of tabs.
When you open RStudio, you should see something like this:
R console in Left: can run commands in an interactive fashion. Type a command and hit the Enter key.
When you open RStudio, you should see something like this:
Environment in Top-right: list of objects that we have access to.
When you open RStudio, you should see something like this:
Files in Bottom-right: allows you to navigate the directory structure on your computer.
When you open RStudio, you should see something like this:
Plots in Bottom-right: any graphical output you make will be displayed here.
When you open RStudio, you should see something like this:
Help in Bottom-right: documentation for functionName appears here when you type ?functionName in the console.
When you open RStudio, you should see something like this:
Top-left: nothing so far, but potentially your R script or R markdown.
.R file extensionsTo create R script: click in the top-left corner of the window, and click “R Script”.
To execute the code: highlight the code and click at the top of the window (or
Cmd-Enter on a Mac, Ctrl-Enter on Windows).
The output will appear in Console or Plots section.
.Rmd extensionsTo create R markdown: click in the top-left corner of the window, and click “R Markdown”.
To inserts a new code chunk: click Option-Cmd-I on a Mac or Ctrl-Alt-I on Windows.
To execute the code: click on the green arrow in the top-right corner of the code chunk.
The output will appear below the code chunk.
include = FALSE: prevents code and results from appearing in the finished file. R Markdown still runs the code in the chunk, and the results can be used by other chunks.
echo = FALSE: prevents code, but not the results from appearing in the finished file.
eval = FALSE: Code appears in the output but is not run.
message = FALSE: prevents messages that are generated by code from appearing in the finished file.
warning = FALSE: prevents warnings that are generated by code from appearing in the finished.
Add your description outside the code chunk.
Can add:
Markdown reference here.
.Rmd file in RStudio..html file in the same folder as the .Rmd file..html file.
c() function, or using the : shortcut## [1] "a" "b" "c"
c() function, or using the : shortcut## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 1 3 5 7 9 11 13 15 17 19
## [1] 1 4 9 16 25 36 49 64 81 100
To extract a subset of elements by their indices, put a vector of indices in square brackets
## [1] 1 4 9 16 25 36 49 64 81 100
Continuous chunk (all elements from 3 to 7)
## [1] 9 16 25 36 49
Just some elements (elements 3 and 5)
## [1] 9 25
All except (elements 3 and 5)
## [1] 1 4 16 36 49 64 81 100
Two-dimensional analogs of vectors
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
To extract a subset of elements put the row numbers then column number.
One element (1st row, 2nd column)
## [1] 4
One row (3rd row)
## [1] 3 6 9 12
A block (1-3 rows and 1-3 cols)
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
list() functionTo extract parts of a list use [[ or $ notation to refer to a specific key-value pair
## [1] "Honda"
## [1] "Fit" "CR-V" "Odyssey"
Data structure for storing datasets.
| mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
| Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
| Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
| Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
| Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
| Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
Data frame is has a list structure.
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
And a matrix structure.
## [1] 6
Useful commands:
head() - print top rowsView() - open the data table in a new tabdim(), ncol(), nrow() - check data dimensionsnames() - print variable (col) namessummary() - check data summary (e.g. mean, max, meadian)? - check data description (if it is from a package)Answer these questions about mtcars:
mtcars?wt variable?wt?Add the answer to your R markdown (code + comments).
ggplot essential elements of graphics: data, geometries, aestheticsGeometries: Visual elements used for our data
geom_point()geom_line()geom_histogram() -geom_bar -geom_boxplot()Here we use geom_point().
ggplot essential elements of graphics: data, geometries, aestheticsAesthetics: Defines the data columns which affect various aspects of the geom. Depend on the geometries you choose.
xycolorfillsizealphalinetypeHere we use three aesthetics:
x: weighty: mpgcolor: cylinders(shape, size, etc. take on default values, not determined by data)
x: weighty: mpgsize: cylindersalpha: weightx: weighty: mpgcolor: cylindersshape: cylindersggplot2 code## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot2 codeNote that shape is outside the aesthetics: although it controls visual properties of the plot, it has nothing to do with the data. Aesthetics links your data to the plot visual properties.
Why do we care? It helps you to choose the plot type.
What is the distribution of cylinders in my dataset?
ggplot() +
geom_bar(data = df, mapping = aes(x = cylinders)) +
ggtitle("Count by cylinders") +
xlab("No. of cylinders")Note that y is automatically set to counts.
What is the distribution of miles per gallon in my dataset?
ggplot() +
geom_histogram(data = df, mapping = aes(x = mpg)) +
ggtitle("Histogram of miles per gallon")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
What is the relationship between mpg and weight?
ggplot() +
geom_point(data = df, mapping = aes(y = mpg, x = weight), size = 2) +
ggtitle("Miles per gallon vs. weight")What is the relationship between mpg and time?
ggplot() +
geom_line(data = vehicles, mapping = aes(y = `mean highway mpg`, x = year)) +
ggtitle("Mean highway mpg by year")Easier to see the trend.
For each value of cylinder, what is the distribution of mpg like?
How often does each pair of cylinder and gear occur in the dataset?
ggplot() +
geom_tile(data = df, mapping = aes(y = gear, x = cylinders, fill = count)) +
ggtitle("Distribution of (cylinder, gear)")We can have more than one layer in a graphic.
=
+
Each layer contains (essentially):
ggplot2 codeggplot() +
geom_boxplot(data = df, mapping = aes(x = cylinders, y = mpg)) +
geom_point(data = df, mapping = aes(x = cylinders, y = mpg),
position = "jitter")ggplot2 codeWhen layers share attributes, we only have to type them once:
ggplot(data = df, mapping = aes(x = cylinders, y = mpg)) +
geom_boxplot() +
geom_point(position = "jitter")ggplot2 codedata = if it is the first argument of ggplot()mapping = if:
ggplot()geom_xx() functionggplot(data = df, mapping = aes(y = mpg, x = weight)) +
geom_point(aes(col = cylinders), size = 2) +
facet_wrap(~cylinders, ncol = 1)ggplot(data = df, mapping = aes(y = mpg, x = weight)) +
geom_point(aes(col = cylinders), size = 2) +
facet_grid(gear~cylinders, labeller = label_both)ggtitle(), xlab(), ylab() - change the title, and axes namestheme_minimal(), theme_classic(), theme_dark() - change the plot backgroundscale_x_continuous(), scale_y_continuous() - change the axes rangescale_color_brewer() - select the color paletteIn the mtcars data:
Add the answer to your R markdown (code + comments). (note that in the above code I applied some preprocessing to the data, check ?mtcars to see the description of the original data).
dplyrselect(): pick variables/columns by their namesmutate(): create new variables/columns based on existing onesarrange(): reorder rows, use arrange(desc()) to reoreder in descending fashionfilter(): pick rows by their valuessummarize(): collapse many rows down to a single summarygroup_by(): perform operations at a group levelSee lectures 5-6 for details.
For the mtcars data:
dplyr statement with at least 3 pipes %>%. Explain what it does.group_by() and summarise(). Explain what it does.Add the answer to your R markdown (code + comments).